The Ngram Statistics Package (Text: : NSP) : A Flexible Tool for Identifying Ngrams, Collocations, and Word Associations

نویسندگان

  • Ted Pedersen
  • Satanjeev Banerjee
  • Bridget T. McInnes
  • Saiyam Kohli
  • Mahesh Joshi
  • Ying Liu
چکیده

The Ngram Statistics Package (Text::NSP) is freely available open-source software that identifies ngrams, collocations and word associations in text. It is implemented in Perl and takes advantage of regular expressions to provide very flexible tokenization and to allow for the identification of non-adjacent ngrams. It includes a wide range of measures of association that can be used to identify collocations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Masks, Suffix Array-Based Data Structures And Multidimensional Arrays To Compute Positional Ngram Statistics From Corpora

This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extr...

متن کامل

Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics

The existing word representation methods mostly limit their information source to word co-occurrence statistics. In this paper, we introduce ngrams into four representation methods: SGNS, GloVe, PPMI matrix, and its SVD factorization. Comprehensive experiments are conducted on word analogy and similarity tasks. The results show that improved word representations are learned from ngram cooccurre...

متن کامل

An End-to-End Supervised Target-Word Sense Disambiguation System

We present an extensible supervised Target-Word Sense Disambiguation system that leverages upon GATE (General Architecture for Text Engineering), NSP (Ngram Statistics Package) and WEKA (Waikato Environment for Knowledge Analysis) to present an end-toend solution that integrates feature identification, feature extraction, preprocessing and classification.

متن کامل

Retrieving Collocations from Text: Xtract

Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to ...

متن کامل

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). It outputs the matched ngrams with their frequencies as well as all the context...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011